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(57) ABSTRACT 


The present invention relates to a method and apparatus for 
obtaining complete speech signals for speech recognition 
applications. In one embodiment, the method continuously 
records an audio stream comprising a sequence of frames to a 
circular buffer. When a user command to commence or ter- 
minate speech recognition is received, the method obtains a 
number of frames of the audio stream occurring before or 
after the user command in order to identify an augmented 
audio signal for speech recognition processing. In further 
embodiments, the method analyzes the augmented audio sig- 
nal in order to locate starting and ending speech endpoints 
that bound at least a portion of speech to be processed for 
recognition. At least one of the speech endpoints is located 
using a Hidden Markov Model. 

39 Claims, 5 Drawing Sheets 
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METHOD AND APPARATUS FOR 
OBTAINING COMPLETE SPEECH SIGNALS 
FOR SPEECH RECOGNITION 
APPLICATIONS 

5 

CROSS REFERENCE TO RELATED 
APPLICATIONS 

This application claims the benefit of U.S. Provisional 
Patent Application No. 60/606,644, filed Sep. 1, 2004 (en- to 
titled “Method and Apparatus for Obtaining Complete 
Speech Signals for Speech Recognition Applications”), 
which is herein incorporated by reference in its entirety. 

REFERENCE TO GOVERNMENT FUNDING 15 

This invention was made with Government support under 
contract number DAAH01-00-C-R003, awarded by Defense 
Advance Research Projects Agency and under contract num- 
ber NAG2-1568 awarded by NASA. The Government has 20 
certain rights in this invention. 

FIELD OF THE INVENTION 

The present invention relates generally to the field of 25 
speech recognition and relates more particularly to methods 
for obtaining speech signals for speech recognition applica- 
tions. 

BACKGROUND OF THE DISCLOSURE 30 

The accuracy of existing speech recognition systems is 
often adversely impacted by an inability to obtain a complete 
speech signal for processing. For example, imperfect syn- 
chronization between a user’s actual speech signal and the 35 
times at which the user commands the speech recognition 
system to listen for the speech signal can cause an incomplete 
speech signal to be provided for processing. For instance, a 
user may begin speaking before he provides the command to 
process his speech (e.g., by pressing a button), or he may 40 
terminate the processing command before he is finished utter- 
ing the speech signal to be processed (e.g., by releasing or 
pressing a button). If the speech recognition system does not 
“hear” the user’ s entire utterance, the results that the speech 
recognition system subsequently produces will not be as 45 
accurate as otherwise possible. In open-microphone applica- 
tions, audio gaps between two utterances (e.g., due to latency 
or others factors) can also produce incomplete results if an 
utterance is started during the audio gap. 

Poor endpointing (e.g. , determining the start and the end of 50 
speech in an audio signal) can also cause incomplete or inac- 
curate results to be produced. Good endpointing increases the 
accuracy of speech recognition results and reduces speech 
recognition system response time by eliminating background 
noise, silence, and other non-speech sounds (e.g., breathing, 55 
coughing, and the like) from the audio signal prior to process- 
ing. By contrast, poor endpointing may produce more flawed 
speech recognition results or may require the consumption of 
additional computational resources in order to process a 
speech signal containing extraneous information. Efficient 60 
and reliable endpointing is therefore extremely important in 
speech recognition applications. 

Conventional endpointing methods typically use short- 
time energy or spectral energy features (possibly augmented 
with other features such as zero -crossing rate, pitch, or dura- 65 
tion information) in order to determine the start and the end of 
speech in a given audio signal. However, such features 
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become less reliable under conditions of actual use (e.g., 
noisy real-world situations), and some users elect to disable 
endpointing capabilities in such situations because they con- 
tribute more to recognition error than to recognition accuracy. 

Thus, there is a need in the art for a method and apparatus 
for obtaining complete speech signals for speech recognition 
applications. 

SUMMARY OF THE INVENTION 

In one embodiment, the present invention relates to a 
method and apparatus for obtaining complete speech signals 
for speech recognition applications. In one embodiment, the 
method continuously records an audio stream which is con- 
verted to a sequence of frames of acoustic speech features and 
stored in a circular buffer. When a user command to com- 
mence or terminate speech recognition is received, the 
method obtains a number of frames of the audio stream occur- 
ring before or after the user command in order to identify an 
augmented audio signal for speech recognition processing. 

In further embodiments, the method analyzes the aug- 
mented audio signal in order to locate starting and ending 
speech endpoints that bound at least a portion of speech to be 
processed for recognition. At least one of the speech end- 
points is located using a Hidden Markov Model. 

BRIEF DESCRIPTION OF THE DRAWINGS 

The teachings of the present invention can be readily 
understood by considering the following detailed description 
in conjunction with the accompanying drawings, in which: 

FIG. 1 is a flow diagram illustrating one embodiment of a 
method for speech recognition processing of an augmented 
audio stream, according to the present invention; 

FIG. 2 is a flow diagram illustrating one embodiment of a 
method for performing endpoint searching and speech recog- 
nition processing on an audio signal; 

FIG. 3 is a flow diagram illustrating a first embodiment of 
a method for performing an endpointing search using an 
endpointing HMM, according to the present invention; 

FIG. 4 is a flow diagram illustrating a second embodiment 
of a method for performing an endpointing search using an 
endpointing HMM, according to the present invention; 

FIG. 5 is a high-level block diagram of the present inven- 
tion implemented using a general purpose computing device. 

To facilitate understanding, identical reference numerals 
have been used, where possible, to designate identical ele- 
ments that are common to the figures. 

DETAILED DESCRIPTION 

The present invention relates to a method and apparatus for 
obtaining an improved audio signal for speech recognition 
processing, and to a method and apparatus for improved 
endpointing for speech recognition. In one embodiment, an 
audio stream is recorded continuously by a speech recogni- 
tion system, enabling the speech recognition system to 
retrieve portions of a speech signal that conventional speech 
recognition systems might miss due to user commands that 
are not properly synchronized with user utterances. 

In further embodiments of the invention, one or more Hid- 
den Markov Models (HMMs) are employed to endpoint an 
audio signal in real time in place of a conventional signal 
processing endpointer. Using HMMs for this function 
enables speech start and end detection that is faster and more 
robust to noise than conventional endpointing techniques. 
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FIG. 1 is a flow diagram illustrating one embodiment of a 
method 100 for speech recognition processing of an aug- 
mented audio stream, according to the present invention. The 
method 100 is initialized at step 102 and proceeds to step 104, 
where the method 100 continuously records an audio stream 
(e.g., a sequence of audio frames containing user speech, 
background audio, etc.) to a circular buffer. In step 106, the 
method 100 receives a user command (e.g., via a button press 
or other means) to commence speech recognition, at time 

t=T s - 

In step 108, the user begins speaking, at time t=S. The user 
command to commence speech recognition, received at time 
t=T s , and the actual start of the user speech, at time t=S, are 
only approximately synchronized; the user may begin speak- 
ing before or after the command to commence speech recog- 
nition received in step 106. 

Once the user begins speaking, the method 100 proceeds to 
step 110 and requests a portion of the recorded audio stream 
from the circular buffer starting at time , where N\ is 

an interval of time such that T s -N x <S ^T s most of the time. In 
one embodiment, the interval N : is chosen by analyzing real 
or simulated user data and selecting the minimum value of N\ 
that minimizes the speech recognition error rate on that data. 
In some embodiments, a sufficient value for N x is in the range 
of tenths of a second. In another embodiment, where the audio 
signal for speech recognition processing has been acquired 
using an open-microphone mode, N\ is approximately equal 
to T^-Tp, where T p is the absolute time at which the previous 
speech recognition process on the previous utterance ended. 
Thus, the current speech recognition process will start on the 
first audio frame that was not recognized in the previous 
speech recognition processing. 

In step 112, the method 100 receives a user command (e.g., 
via a button press or other means) to terminate speech recog- 
nition, at time t=T^. In step 114, the user stops speaking, at 
time t=E. The user command to terminate speech recognition, 
received at time t=T £ , and the actual end of the user speech, at 
time t=E, are only approximately synchronized; the user may 
stop speaking before or after the command to terminate 
speech recognition received in step 112 . 

In step 116, the method 100 requests a portion of the audio 
stream from the circular buffer up to time t=T £ +N 2 , where N 2 
is an interval of time such that T £ ^E<T £ +N 2 most of the 
time. In one embodiment, N 2 is chosen by analyzing real or 
simulated user data and selecting the minimum value of N 2 
that minimizes the speech recognition error rate on that data. 
Thus, an augmented audio signal starting at time T s -N 1 and 
ending at time T £ +N 2 is identified. 

In step 118 (illustrated in phantom), the method 100 
optionally performs an endpoint search on at least a portion of 
the augmented audio signal. In one embodiment, an endpoint- 
ing search in accordance with step 118 is performed using a 
conventional endpointing technique. In another embodiment, 
an endpointing search in accordance with step 118 is per- 
formed using one or more Hidden Markov Models (HMMs), 
as described in further detail below in connection with FIG. 2 . 

In step 120, the method 100 applies speech recognition 
processing to the endpointed audio signal. Speech recogni- 
tion processing may be applied in accordance with any known 
speech recognition technique. 

The method 100 then returns to step 104 and continues to 
record the audio stream to the circular buffer. Recording of 
the audio stream to the circular buffer is performed in parallel 
with the speech recognition processes, e.g., steps 106-120 of 
the method 100 . 

The method 100 affords greater flexibility in choosing 
speech signals for recognition processing than conventional 
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speech recognition techniques. Importantly, the method 100 
improves the likelihood that a user’s entire utterance is pro- 
vided for recognition processing, even when user operation of 
the speech recognition system would normally provide an 
5 incomplete speech signal. Because the method 100 continu- 
ously records the audio stream containing the speech signals, 
the method 100 can “back up” or “go forward” to retrieve 
portions of a speech signal that conventional speech recogni- 
tion systems might miss due to user commands that are not 
l o properly synchronized with user utterances . Thus, more com- 
plete and more accurate speech recognition results are pro- 
duced. 

Moreover, because the audio stream is continuously 
recorded even when speech is not being actively processed, 
15 the method 100 enables new interaction strategies. For 
example, speech recognition processing can be applied to an 
audio stream immediately upon command, from a specified 
point in time (e.g., in the future or recent past), or from a last 
detected speech endpoint (e.g., a speech starting or speech 
20 ending point), among other times. Thus, speech recognition 
can be performed, on the user’s command, from a frame that 
is not necessarily the most recently recorded frame (e.g., 
occurring some time before or after the most recently 
recorded frame). 

25 FIG. 2 is a flow diagram illustrating one embodiment of a 
method 200 for performing endpoint searching and speech 
recognition processing on an audio signal, e.g., in accordance 
with steps 118-120 of FIG. 1. The method 200 is initialized at 
step 202 and proceeds to step 204, where the method 200 
30 receives an audio signal, e.g., from the method 100. 

In step 206, the method 200 performs a speech endpointing 
search using an endpointing HMM to detect the start of the 
speech in the received audio signal. In one embodiment, the 
endpointing HMM recognizes speech and silence in parallel, 
35 enabling the method 200 to hypothesize the start of speech 
when speech is more likely than silence. Many topologies can 
be used for the speech HMM, and a standard silence HMM 
may also be used. In one embodiment, the topology of the 
speech HMM is defined as a sequence of one or more reject 
40 “phones”, where a reject phone is an HMM model trained on 
all types of speech. In another embodiment, the topology of 
the speech HMM is defined as a sequence (or sequence of 
loops) of context-independent (Cl) or other phones. In further 
embodiments, the endpointing HMM has a pre-determined 
45 but configurable minimum duration, which may be a function 
of the number of reject or other phones in sequence in the 
speech HMM, and which enables the endpointer to more 
easily reject short noises as speech. 

In one embodiment, the method 200 identifies the speech 
50 starting frame when it detects a predefined sufficient number 
of frames of speech in the audio signal. The number of frames 
of speech that are required to indicate a speech endpoint may 
be adjusted as appropriate for different speech recognition 
applications. Embodiments of methods for implementing an 
55 endpointing HMM in accordance with step 206 are described 
in further detail below with reference to FIGS. 3-4. 

In step 208, once the speech starting frame, F SD , is 
detected, the method 200 backs up a pre-defined number B of 
frames to a frame F s preceding the speech starting frame F SD , 
60 such that F S =F SZ) -B becomes the new “start frame” for the 
speech for the purposes of the speech recognition process. In 
one embodiment, the number B of frames by which the 
method 200 backs up is relatively small (e.g., approximately 
10 frames), but is large enough to ensure that the speech 
65 recognition process begins on a frame of silence. 

In step 210, the method 200 commences recognition pro- 
cessing starting from the new start frame F s identified in step 
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108. In one embodiment, recognition processing is performed 
in accordance with step 210 using a standard speech recog- 
nition HMM separate from the endpointing HMM. 

In step 212, the method 200 detects the end of the speech to 
be processed. In one embodiment, a speech “end frame” is 
detected when the recognition process started in step 210 of 
the method 200 detects a predefined sufficient number of 
frames of silence following frames of speech. In one embodi- 
ment, the number of frames of silence that are required to 
indicate a speech endpoint is adjustable based on the particu- 
lar speech recognition application. In another embodiment, 
the ending/ silence frames might be required to legally end the 
speech recognition grammar, forcing the endpointer not to 
detect the end of speech until a legal ending point. In another 
embodiment, the speech end frame is detected using the same 
endpointing HMM used to detect the speech start frame. 
Embodiments of methods for implementing an endpointing 
HMM in accordance with step 212 are described in further 
detail below with reference to FIGS. 3-4. 

In step 214, the method 200 terminates speech recognition 
processing and outputs recognized speech, and in step 216, 
the method 200 terminates. 

Implementation of endpointing HMM’s in conjunction 
with the method 200 enables more accurate detection of 
speech endpoints in an input audio signal, because the method 
200 does not have any internal parameters that directly 
depend on the characteristics of the audio signal and that 
require extensive tuning. Moreover, the method 200 does not 
utilize speech features that are unreliable in noisy environ- 
ments. Furthermore, because the method 200 requires mini- 
mal computation (e.g., processing while detecting the start 
and the end of speech is minimal), speech recognition results 
can be produced more rapidly than is possible by conven- 
tional speech recognition systems. Thus, the method 200 can 
rapidly and reliably endpoint an input speech signal in virtu- 
ally any environment. 

Moreover, implementation of the method 200 in conjunc- 
tion with the method 100 improves the likelihood that a user’s 
complete utterance is provided for speech recognition pro- 
cessing, which ultimately produces more complete and more 
accurate speech recognition results. 

FIG. 3 is a flow diagram illustrating a first embodiment of 
a method 300 for performing an endpointing search using an 
endpointing HMM, according to the present invention. The 
method 300 may be implemented in accordance with step 206 
and/or step 212 of the method 200 to detect endpoints of 
speech in an audio signal received by a speech recognition 
system. 

The method 300 is initialized at step 302 and proceeds to 
step 304, where the method 300 counts a number, F x , of 
frames of the received audio signal in which the most likely 
word (e.g., according to the standard HMM Viterbi search 
criteria) is speech in the last N\ preceding frames. In one 
embodiment, N\ is a predefined parameter that is config- 
urable based on the particular speech recognition application 
and the desired results. Once the number F 1 of frames is 
determined, the method 300 proceeds to step 306 and deter- 
mines whether the number F l of frames exceeds a first pre- 
defined threshold, T : . Again, the first predefined threshold, 
T : , is configurable based on the particular speech recognition 
application and the desired results. 

If the method 300 concludes in step 306 that F l does not 
exceed T : , the method 300 proceeds to step 310 and continues 
to search the audio signal for a speech endpoint, e.g., by 
returning to step 304, incrementing the location in the speech 
signal by one frame, and continuing to count the number of 
speech frames in the last N\ frames of the audio signal. Alter- 
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natively, if the method 300 concludes in step 306 that F 1 does 
exceed T l5 the method 300 proceeds to step 308 and defines 
the first frame F SD of the frame sequence that includes the 
number (F x ) of frames as the speech starting point. The 
5 method 300 then backs up to a predefined number B of frames 
before the speech starting frame for speech recognition pro- 
cessing, e.g., in accordance with step 208 of the method 200. 
In one embodiment, values for the parameters N\ and T 1 are 
determined to simultaneously minimize the probability of 
to detecting short noises as speech and maximize the probability 
of detecting single, short words (e.g., “yes” or “no”) as 
speech. 

In one embodiment, the method 300 may be adapted to 
detect the speech stopping frame as well as the speech starting 
15 frame (e.g., in accordance with step 212 of the method 200). 
However, in step 304, the method 300 would count the num- 
ber, F 2 , of frames of the received audio signal in which the 
most likely word is silence in the last N 2 preceding frames. 
Then, when that number, F 2 , meets a second predefined 
20 threshold, T 2 , speech recognition processing is terminated 
(e.g., effectively identifying the frame at which recognition 
processing is terminated as the speech endpoint). In either 
case, the method 300 is robust to noise and produces accurate 
speech recognition results with minimal computational com- 
25 plexity. 

FIG. 4 is a flow diagram illustrating a second embodiment 
of a method 400 for performing an endpointing search using 
an endpointing HMM, according to the present invention. 
Similar to the method 300, the method 400 may be imple- 
30 mented in accordance with step 206 and/or step 212 of the 
method 200 to detect endpoints of speech in an audio signal 
received by a speech recognition system. 

The method 400 is initialized at step 402 and proceeds to 
step 404, where the method 400 identifies the most likely 
35 word in the endpointing search (e.g., in accordance with the 
standard Viterbi HMM search algorithm). 

In order to determine the speech starting endpoint, in step 
406 the method 400 determines whether the most likely word 
identified in step 404 is speech or silence. If the method 400 
40 concludes that the most likely word is speech, the method 400 
proceeds to step 408 and computes the duration, D^, back to 
the most recent pause-to -speech transition. 

In step 410, the method 400 determines whether the dura- 
tion meets or exceeds a first predefined threshold T l . If the 

45 method 400 concludes that the duration does not meet or 
exceed T l , then the method 400 determines that the identified 
most likely word does not represent a starting endpoint of the 
speech, and the method 400 processes the next audio frame 
and returns to step 404 and to continue the search for a starting 
50 endpoint. 

Alternatively, if the method 400 concludes in step 410 that 
the duration D s does meet or exceed T 1? then the method 400 
proceeds to step 412 and identifies the first frame F SD of the 
most likely speech word identified in step 404 as a speech 
55 starting endpoint. Note that according to step 208 of the 
method 200, speech recognition processing will start some 
number B of frames before the speech starting point identified 
in step 404 of the method 400 at frame F^=F^-B. The 
method 400 then terminates in step 422. 

60 To determine the speech ending endpoint, referring back to 

step 406, if the method 400 concludes that the most likely 
word identified in step 404 is not speech (i.e., is silence), the 
method 400 proceeds to step 414, where the method 400 
confirms that the frame(s) in which the most likely word 
65 appears is subsequent to the frame representing the speech 
starting point. If the method 400 concludes that the frame in 
which the most likely word appears is not subsequent to the 



US 7,610,199 B2 


7 

frame of the speech starting point, then the method 400 con- 
cludes that the most likely word identified in step 404 is not a 
speech endpoint and returns to step 404 to process the next 
audio frame and continue the search for a speech endpoint. 

Alternatively, if the method 400 concludes in step 414 that 5 
the frame in which the most likely word appears is subsequent 
to the frame of the speech starting point, the method 400 
proceeds to step 416 and computes the duration, D^, back to 
the most recent speech-to-pause transition. 

In step 418, the method 400 determines whether the dura- to 
tion, D^, meets or exceeds a second predefined threshold T 2 . 

If the method 400 concludes that the duration T> p does not 
meet or exceed T 2 , then the method 400 determines that the 
identified most likely word does not represent an endpoint of 
the speech, and the method 400 processes the next audio 15 
frame and returns to step 404 to continue the search for an 
ending enpoint. 

However, if the method 400 concludes in step 418 that the 
duration does meet or exceed T 2 , then the method 400 
proceeds to step 420 and identifies the most likely word 20 
identified in step 404 as a speech endpoint (specifically, as a 
speech ending endpoint). The method 400 then terminates in 
step 422. 

The method 400 produces accurate speech recognition 
results in a manner that is more robust to noise, but more 25 
computationally complex than the method 300. Thus, the 
method 400 may be implemented in cases where greater noise 
robustness is desired and the additional computational com- 
plexity is less of a concern. The method 300 may be imple- 
mented in cases where it is not feasible to determine the 30 
duration back to the most recent pause-to- speech or speech- 
to-pause transition (e.g., when backtrace information is lim- 
ited due to memory constraints). 

In one embodiment, when determining the speech ending 
frame in step 418 of the method 400, an additional require- 35 
ment that the speech ending word legally ends the speech 
recognition grammar can prevent premature speech endpoint 
detection when a user utters a long pause in the middle of an 
utterance. 

FIG. 5 is a high-level block diagram of the present inven- 40 
tion implemented using a general purpose computing device 
500. It should be understood that the digital scheduling 
engine, manager or application (e.g., for endpointing audio 
signals for speech recognition) can be implemented as a 
physical device or subsystem that is coupled to a processor 45 
through a communication channel. Therefore, in one embodi- 
ment, a general purpose computing device 500 comprises a 
processor 502, a memory 504, a speech endpointer or module 
505 and various input/output (I/O) devices 506 such as a 
display, a keyboard, a mouse, a modem, and the like. In one 50 
embodiment, at least one I/O device is a storage device (e.g., 
a disk drive, an optical disk drive, a floppy disk drive). 

Alternatively, the digital scheduling engine, manager or 
application (e.g., speech endpointer 505) can be represented 
by one or more software applications (or even a combination 55 
of software and hardware, e.g., using Application Specific 
Integrated Circuits (ASIC)), where the software is loaded 
from a storage medium (e.g., I/O devices 506) and operated 
by the processor 502 in the memory 504 of the general pur- 
pose computing device 500. Thus, in one embodiment, the 60 
speech endpointer 505 for endpointing audio signals 
described herein with reference to the preceding Figures can 
be stored on a computer readable medium or carrier (e.g., 
RAM, magnetic or optical drive or diskette, and the like). 

The endpointing methods of the present invention may also 65 
be easily implemented in a variety of existing speech recog- 
nition systems, including systems using “hold-to-talk”, 
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“push- to -talk”, “open microphone”, “barge-in” and other 
audio acquisition techniques. Moreover, the simplicity of the 
endpointing methods enables the endpointing methods to 
automatically take advantage of improvements to a speech 
recognition system’s acoustic speech features or acoustic 
models with little or no modification to the endpointing meth- 
ods themselves. For example, upgrades or improvements to 
the noise robustness of the system’s speech features or acous- 
tic models correspondingly improve the noise robustness of 
the endpointing methods employed. 

Thus, the present invention represents a significant 
advancement in the field speech recognition. One or more 
Hidden Markov Models are implemented to endpoint (poten- 
tially augmented) audio signals for speech recognition pro- 
cessing, resulting in an endpointing method that is more 
efficient, more robust to noise and more reliable than existing 
endpointing methods. The method is more accurate and less 
computationally complex than conventional methods, mak- 
ing it especially useful for speech recognition applications in 
which input audio signals may contain background noise 
and/or other non- speech sounds. 

Although various embodiments which incorporate the 
teachings of the present invention have been shown and 
described in detail herein, those skilled in the art can readily 
devise many other varied embodiments that still incorporate 
these teachings. 

What is claimed is: 

1. A method for recognizing speech in an audio stream 
comprising a sequence of audio frames, the method compris- 
ing the steps of: 

continuously recording said audio stream to a buffer; 

receiving a command to recognize speech in a first portion 
of said audio stream, where said first portion of said 
audio stream occurs between a user-designated start 
point and a user-designated end point, and where said 
command is distinct from said audio stream; 

augmenting said first portion of said audio stream with one 
or more audio frames of said audio stream that do not 
occur between said user-designated start point and said 
user-designated end point to form an augmented audio 
signal; and 

outputting a recognized speech in accordance with said 
augmented audio signal. 

2. The method of claim 1, wherein said augmenting step 
comprises: 

detecting a speech starting point in said audio stream at 
which a speech signal including said first portion of said 
audio stream actually starts; and 

augmenting said speech signal with one or more audio 
frames immediately preceding said user-designated start 
point to form said augmented audio signal. 

3. The method of claim 2, wherein said augmented audio 
signal begins at an audio frame that occurs before said speech 
starting point, and said speech starting point occurs at or 
before said user-designated start point. 

4. The method of claim 1, wherein said augmenting step 
comprises: 

detecting a speech ending point in said audio stream at 
which a speech signal including said first portion of said 
audio stream actually ends; 

augmenting said speech signal with one or more audio 
frames immediately following said user-designated end 
point to form said augmented audio signal. 

5. The method of claim 4, wherein said augmented audio 
signal ends at an audio frame that occurs after said speech 
ending point, and said speech ending point occurs at or after 
said user-designated end point. 
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6. The method of claim 1, further comprising the steps of: 
performing an endpointing search on said augmented 

audio signal; and 

applying speech recognition processing to the endpointed 
audio signal. 

7. The method of claim 6, wherein said endpointing search 
comprises the steps of: 

locating at least a first speech endpoint in said audio signal 
using a first Hidden Markov Model; and 
locating a second speech endpoint in said audio signal, 
such that at least a portion of said audio signal located 
between said first speech endpoint and said second 
speech endpoint represents speech. 

8. The method of claim 7, wherein said second speech 
endpoint is located using said first Hidden Markov Model. 

9. The method of claim 7, wherein said first speech end- 
point is a speech starting point represented by a first frame of 
said audio signal and said second speech endpoint is a speech 
ending point represented by a second frame of said audio 
signal, said second frame occurring subsequent to said first 
frame. 

10. The method of claim 9, further comprising the step of: 
backing up a pre-defined number of frames to a third frame 

of said audio signal that precedes said first frame; and 
performing speech recognition processing on at least a 
portion of said audio signal located between said third 
speech endpoint and said second speech endpoint. 

11. The method of claim 10, wherein said speech recogni- 
tion processing is performed using a second Hidden Markov 
Model. 

12. The method of claim 10, wherein said step of locating 
at least a first speech endpoint comprises: 

counting a number of frames of said audio signal for which 
a most likely word in a pre-defined quantity of preceding 
frames is speech; 

determining whether said number of frames exceeds a first 
pre-defined threshold; and 

identifying a starting frame of said number of frames as a 
speech starting point, if said number of frames exceeds 
said first pre-defined threshold. 

13. The method of claim 9, wherein said step of locating a 
second speech endpoint comprises: 

counting a number of frames of said audio signal for which 
a most likely word in a pre-defined quantity of preceding 
frames is silence; 

determining whether said number of frames exceeds a 
second pre-defined threshold; and 
identifying a starting frame of said number of frames as a 
speech ending point, if said number of frames exceeds 
said first pre-defined threshold. 

14. The method of claim 7, wherein said step of locating at 
least a first speech endpoint comprises: 

identifying a most likely word in said audio signal; and 
determining whether a duration of said most likely word is 
long enough to indicate that said most likely word rep- 
resents said first speech endpoint. 

15. The method of claim 14, wherein said identifying step 
comprises: 

recognizing said most likely word as either speech or 
silence. 

16. The method of claim 14, wherein said determining step 
comprises: 

computing said most likely word’ s duration back to a most 
recent pause-to -speech transition in said audio signal, if 
said most likely word is speech; and 
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identifying said most likely word as a speech starting point 
if said duration meets or exceeds a first pre-defined 
threshold. 

17. The method of claim 14, wherein said determining step 

5 comprises: 

computing said most likely word’s duration back to a most 
recent speech-to-pause transition in said audio signal, if 
said most likely word is silence; 
verifying that an audio signal frame containing said most 

to likely word is subsequent to an audio signal frame con- 
taining a speech starting point; and 
identifying said most likely word as a speech ending point 
if said duration meets or exceeds a second pre-defined 
threshold. 

15 18. The method of claim 1 4, wherein the step of identifying 

a most likely word comprises: 

identifying a most likely stopping word for speech in said 
audio signal, where said most likely stopping word rep- 
resents a potential speech ending point; and 

20 selecting a predecessor word of said most likely stopping 
word as said most likely word in said audio signal. 

19. The method of claim 7, wherein said endpointing 
search is improved by improving at least one acoustic model 
implemented therein. 

25 20. The method of claim 1, further comprising: 

receiving a command to recognize speech starting from a 
specific frame in said audio stream, where said specific 
frame is recorded some time before or after a most 
recently recorded frame. 

30 21. A computer readable storage medium containing an 

executable program for recognizing speech in an audio 
stream comprising a sequence of audio frames, where the 
program performs the steps of: 

continuously recording said audio stream to a buffer; 

35 receiving a command to recognize speech in a first portion 
of said audio stream, where said first portion of said 
audio stream occurs between a user-designated start 
point and a user-designated end point, and where said 
command is distinct from said audio stream; 

40 augmenting said first portion of said audio stream with one 
or more audio frames of said audio stream that do not 
occur between said user-designated start point and said 
user-designated end point to form an augmented audio; 
and 

45 outputting a recognized speech in accordance with said 
augmented audio signal. 

22. The computer readable storage medium of claim 21, 
wherein said augmenting step comprises: 

detecting a speech starting point in said audio stream at 

50 which a speech signal including said first portion of said 
audio stream actually starts; and 
augmenting said speech signal with one or more audio 
frames immediately preceding said user-designated start 
point to form said augmented audio signal. 

55 23. The computer readable storage medium of claim 22, 

wherein said augmented audio signal begins at an audio frame 
that occurs before said speech starting point, and said speech 
starting point occurs at or before said user-designated start 
point. 

60 24. The computer readable storage medium of claim 21, 

wherein said augmenting step comprises: 

detecting a speech ending point in said audio stream at 
which a speech signal including said first portion of said 
audio stream actually ends; 

65 augmenting said speech signal with one or more audio 
frames immediately following said user-designated end 
point to form said augmented audio signal. 
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25. The computer readable storage medium of claim 24, 
wherein said augmented audio signal ends at an audio frame 
that occurs after said speech ending point, and said speech 
ending point occurs at or after said user-designated endpoint. 

26. The computer readable storage medium of claim 21, 5 
further comprising the steps of: 

performing an endpointing search on said augmented 
audio signal; and 

applying speech recognition processing to the endpointed 
audio signal. 10 

27. The computer readable storage medium of claim 26, 
wherein said endpointing search comprises the steps of: 

locating at least a first speech endpoint in said audio signal 
using a first Hidden Markov Model; and 
locating a second speech endpoint in said audio signal, 15 
such that at least a portion of said audio signal located 
between said first speech endpoint and said second 
speech endpoint represents speech. 

28. The computer readable storage medium of claim 27, 
wherein said second speech endpoint is located using said 20 
first Hidden Markov Model. 

29. The computer readable storage medium of claim 27, 
wherein said first speech endpoint is a speech starting point 
represented by a first frame of said audio signal and said 
second speech endpoint is a speech ending point represented 25 
by a second frame of said audio signal, said second frame 
occurring subsequent to said first frame. 

30. The computer readable storage medium of claim 29, 
further comprising the step of: 

backing up a pre-defined number of frames to a third frame 30 
of said audio signal that precedes said first frame; and 
performing speech recognition processing on at least a 
portion of said audio signal located between said third 
speech endpoint and said second speech endpoint. 

31. The computer readable storage medium of claim 30, 35 
wherein said speech recognition processing is performed 
using a second Hidden Markov Model. 

32. The computer readable storage medium of claim 29, 

wherein said step of locating at least a first speech endpoint 
comprises: 40 

counting a number of frames of said audio signal for which 
a most likely word in a pre-defined quantity of preceding 
frames is speech; 

determining whether said number of frames exceeds a first 
pre-defined threshold; and 45 

identifying a starting frame of said number of frames as a 
speech starting point, if said number of frames exceeds 
said first pre-defined threshold. 

33. The computer readable storage medium of claim 29, 
wherein said step of locating a second speech endpoint com- 50 
prises: 

counting a number of frames of said audio signal for which 
a most likely word in a pre-defined quantity of preceding 
frames is silence; 

determining whether said number of frames exceeds a 55 
second pre-defined threshold; and 
identifying a starting frame of said number of frames as a 
speech ending point, if said number of frames exceeds 
said first pre-defined threshold. 


34. The computer readable storage medium of claim 27, 
wherein said step of locating at least a first speech endpoint 
comprises: 

identifying a most likely word in said audio signal; and 
determining whether a duration of said most likely word is 
long enough to indicate that said most likely word rep- 
resents said first speech endpoint. 

35. The computer readable storage medium of claim 34, 
wherein said identifying step comprises: 

recognizing said most likely word as either speech or 
silence. 

36. The computer readable storage medium of claim 34, 
wherein said determining step comprises: 

computing said most likely word’s duration back to a most 
recent pause-to -speech transition in said audio signal, if 
said most likely word is speech; and 
identifying said most likely word as a speech starting point 
if said duration meets or exceeds a first pre-defined 
threshold. 

37. The computer readable storage medium of claim 34, 
wherein said determining step comprises: 

computing said most likely word’s duration back to a most 
recent speech-to-pause transition in said audio signal, if 
said most likely word is silence; 
verifying that an audio signal frame containing said most 
likely word is subsequent to an audio signal frame con- 
taining a speech starting point; and 
identifying said most likely word as a speech ending point 
if said duration meets or exceeds a second pre-defined 
threshold. 

38. The computer readable storage medium of claim 34, 
wherein the step of identifying a most likely word comprises: 

identifying a most likely stopping word for speech in said 
audio signal, where said most likely stopping word rep- 
resents a potential speech ending point; and 
selecting a predecessor word of said most likely stopping 
word as said most likely word in said audio signal. 

39. Apparatus for recognizing speech in an audio stream 
comprising a sequence of audio frames, the apparatus com- 
prising: 

recording means for continuously recording said audio 
stream to a buffer; 

receiving means for receiving a command to recognize 
speech in a first portion of said audio stream, where said 
first portion of said audio stream occurs between a user- 
designated start point and a user-designated end point, 
and where said command is distinct from said audio 
stream; 

augmenting means for augmenting said first portion of said 
audio stream with one or more audio frames of said 
audio stream that do not occur between said user-desig- 
nated start point and said user-designated end point to 
form an augmented audio signal; and 
output means for outputting a recognized speech in accor- 
dance with said augmented audio signal. 



